{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 概述\n",
    "\n",
    "Pandas 使我们可以以一种所有人都能理解的方式来处理数据。它可以让我们毫不费力地从诸如 csv 类型的文件中导入数据。我们可以用它快速地对数据进行复杂的转换和过滤等操作。Pandas 真是超级棒。\n",
    "\n",
    "Pandas和 Numpy、Matplotlib(可视化) 一起构成了一个 Python 数据探索和分析的强大基础。三者是 Python 科学计算真正的顶梁柱。\n",
    "\n",
    "### 导入 Pandas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "这是导入 pandas 的标准方法。我们不想一直写 pandas 的全名，但是保证代码的简洁和避免命名冲突都很重要，所以折中使用 pd 。如果你去看别人使用 pandas 的代码，就会看到这种导入方式"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Pandas 中的数据类型\n",
    "\n",
    "Pandas 基于两种数据类型，series 和 dataframe。\n",
    "\n",
    "series 是一种一维的数据类型，其中的每个元素都有各自的标签。如果你之前看过这个系列关于 Numpy 的推文，你可以把它当作一个由带标签的元素组成的 numpy 数组。标签可以是数字或者字符。\n",
    "\n",
    "dataframe 是一个二维的、表格型的数据结构。Pandas 的 dataframe 可以储存许多不同类型的数据，并且每个轴都有标签。你可以把它当作一个 series 的字典。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  创建对象\n",
    "创建一个Series对象：传递一个数值列表作为参数，令Pandas使用默认索引。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "s = pd.Series([1,3,5,np.nan,6,8])\n",
    "print(s)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "创建一个DataFrame对象：传递待datetime索引和列标签的Numpy数组作为参数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',\n",
      "               '2018-01-05', '2018-01-06'],\n",
      "              dtype='datetime64[ns]', freq='D')\n",
      "                   A         B         C         D\n",
      "2018-01-01  0.196674  0.219405 -1.697673 -1.137805\n",
      "2018-01-02 -0.424771  0.521244 -0.300820  0.895632\n",
      "2018-01-03  0.722108 -1.681273  0.854681  0.018483\n",
      "2018-01-04  0.398378 -0.385068  1.428257 -0.278602\n",
      "2018-01-05 -1.347804  0.564056  0.676420  0.609848\n",
      "2018-01-06  0.809219  1.486064 -0.056485 -0.887798\n"
     ]
    }
   ],
   "source": [
    "# 首先创建一个时间序列\n",
    "dates = pd.date_range('20180101', periods=6)\n",
    "print(dates)\n",
    "\n",
    "# 创建DataFrame对象，指定index和columns标签\n",
    "df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))\n",
    "print(df)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 也可以传递词典来构建DataFrame，该词典的元素是形如Series的对象\n",
    "df2 = pd.DataFrame({\n",
    "    'A' : 1.5,\n",
    "    'B' : pd.Timestamp('20180102'),\n",
    "    'C' : pd.Series(1,index=list(range(4)),dtype='float32'),\n",
    "    'D' : np.array([3] * 4,dtype='int32'),\n",
    "    'E' : pd.Categorical([\"test\",\"train\",\"test\",\"train\"]),\n",
    "    'F' : 'foo' })\n",
    "\n",
    "print(df2)\n",
    "\n",
    "# 查看df2对象各列的数据类型\n",
    "print(df2.dtypes)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "a = df.iloc[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[0.19667411660881418,\n",
       " 0.21940531171401007,\n",
       " -1.6976726311172907,\n",
       " -1.1378050000381998]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a.to_list()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'A': {Timestamp('2018-01-01 00:00:00', freq='D'): 0.19667411660881418,\n",
       "  Timestamp('2018-01-02 00:00:00', freq='D'): -0.42477082213568673,\n",
       "  Timestamp('2018-01-03 00:00:00', freq='D'): 0.7221084785754129,\n",
       "  Timestamp('2018-01-04 00:00:00', freq='D'): 0.3983783820831922,\n",
       "  Timestamp('2018-01-05 00:00:00', freq='D'): -1.3478042722441395,\n",
       "  Timestamp('2018-01-06 00:00:00', freq='D'): 0.8092185648629963},\n",
       " 'B': {Timestamp('2018-01-01 00:00:00', freq='D'): 0.21940531171401007,\n",
       "  Timestamp('2018-01-02 00:00:00', freq='D'): 0.5212440746383651,\n",
       "  Timestamp('2018-01-03 00:00:00', freq='D'): -1.6812727280622246,\n",
       "  Timestamp('2018-01-04 00:00:00', freq='D'): -0.38506843694473264,\n",
       "  Timestamp('2018-01-05 00:00:00', freq='D'): 0.5640558212431627,\n",
       "  Timestamp('2018-01-06 00:00:00', freq='D'): 1.4860642674766138},\n",
       " 'C': {Timestamp('2018-01-01 00:00:00', freq='D'): -1.6976726311172907,\n",
       "  Timestamp('2018-01-02 00:00:00', freq='D'): -0.30081954479420286,\n",
       "  Timestamp('2018-01-03 00:00:00', freq='D'): 0.8546809491047423,\n",
       "  Timestamp('2018-01-04 00:00:00', freq='D'): 1.42825664215245,\n",
       "  Timestamp('2018-01-05 00:00:00', freq='D'): 0.6764196945469022,\n",
       "  Timestamp('2018-01-06 00:00:00', freq='D'): -0.056485480740064306},\n",
       " 'D': {Timestamp('2018-01-01 00:00:00', freq='D'): -1.1378050000381998,\n",
       "  Timestamp('2018-01-02 00:00:00', freq='D'): 0.895631741458672,\n",
       "  Timestamp('2018-01-03 00:00:00', freq='D'): 0.01848315154626709,\n",
       "  Timestamp('2018-01-04 00:00:00', freq='D'): -0.278601577864584,\n",
       "  Timestamp('2018-01-05 00:00:00', freq='D'): 0.6098476600439467,\n",
       "  Timestamp('2018-01-06 00:00:00', freq='D'): -0.8877983580042081}}"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "res = []\n",
    "for i in range(len(df)):\n",
    "    res.append(df.iloc[i].to_list())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#查看一个DataFrame对象的前几行和最后几行\n",
    "print(df.head(1)) #默认是5项\n",
    "print(\"-----------------------------------------------\")\n",
    "print(df.tail(3))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# 其他基本信息\n",
    "# 索引\n",
    "print(df.index)\n",
    "print(\"-----------------------------------------------\")\n",
    "# 列标签\n",
    "print(df.columns)\n",
    "print(\"-----------------------------------------------\")\n",
    "# 数值\n",
    "print(df.values)\n",
    "print(\"-----------------------------------------------\")\n",
    "# 统计\n",
    "print(df.describe())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 转置与排序\n",
    "\n",
    "# 转置\n",
    "print(df.T)\n",
    "print(\"-----------------------------------------------\")\n",
    "# 按轴排序，逐列递减\n",
    "print(df.sort_index(axis=1, ascending=False))\n",
    "print(\"-----------------------------------------------\")\n",
    "# 按值排序，'B'列逐行递增\n",
    "print(df.sort_values(by='B'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 选中\n",
    "尽管标准Python/Numpy的选中表达式很直观也很适合互动，但在生产代码中还是推荐使用Pandas里的方法如.at,.iat,.loc,.iloc,.ix等。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 分割数据\n",
    "print(df[\"A\"]) # 与df.A相同\n",
    "print(\"-----------------------------------------------\")\n",
    "# 用[]分割DataFrame\n",
    "print(df[0:3])\n",
    "print(\"-----------------------------------------------\")\n",
    "print(df['20180102':'20180104'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 选中一整行\n",
    "print(df.loc[dates[0]])\n",
    "print(\"-----------------------------------------------\")\n",
    "# 按标签选中复数列（所有行，输出只显示前5行）\n",
    "print(df.loc[:,['A','B']])\n",
    "print(\"-----------------------------------------------\")\n",
    "# 行/列同时划分（包括起止点）\n",
    "print(df.loc['20180102':'20180104',['A','B']])\n",
    "print(\"-----------------------------------------------\")\n",
    "# 返回一个元素（两个方法等效）\n",
    "print(df.loc[dates[0],'A'])\n",
    "print(\"-----------------------------------------------\")\n",
    "print(df.at[dates[0],'A'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 位置索引为3的行（从0开始，所以其实是第4行）\n",
    "print(df.iloc[3])\n",
    "print(\"-----------------------------------------------\")\n",
    "# 按位置索引分割DataFrame\n",
    "print(df.iloc[3:5,0:2])\n",
    "print(\"-----------------------------------------------\")\n",
    "print(df.iloc[[1,2,4],[0,2]])\n",
    "print(\"-----------------------------------------------\")\n",
    "\n",
    "# 直接定位一个特定元素\n",
    "print(df.iloc[1,1])\n",
    "print(\"-----------------------------------------------\")\n",
    "print(df.iat[1,1])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 布尔值索引"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 用一列的值来选择数据\n",
    "print(df.A > 0)\n",
    "print(\"-----------------------------------------------\")\n",
    "# 使用.isin()函数过滤数据\n",
    "df2 = df.copy()\n",
    "df2['E'] = ['one', 'one','two','three','four','three']\n",
    "# 提取df2中'E'值属于['two', 'four']的行\n",
    "print(df2[df2['E'].isin(['two','four'])])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 为DataFrame创建一个新的列，其值为时间顺序（与df相同）的索引值\n",
    "s1 = pd.Series([1,2,3,4,5,6], index=pd.date_range('20180101', periods=6))\n",
    "print(s1)\n",
    "\n",
    "df['F'] = s1\n",
    "\n",
    "# 按标签赋值\n",
    "df.at[dates[0],'A'] = 0\n",
    "\n",
    "# 按索引赋值\n",
    "df.iat[0,1] = 0\n",
    "\n",
    "# 用Numpy数组赋值\n",
    "df.loc[:,'D'] = np.array([5] * len(df))\n",
    "print(\"-----------------------------------------------\")\n",
    "# 最终结果\n",
    "print(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 缺失数据\n",
    "Pandas默认使用np.nan来代表缺失数据。Reindexing允许用户对某一轴上的索引改/增/删，并返回数据的副本。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 创建DataFrame对象df1，以dates[0:4]为索引，\n",
    "# 在df的基础上再加一个新的列'E'（初始均为NaN）\n",
    "df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])\n",
    "print(df1)\n",
    "print(\"-----------------------------------------------\")\n",
    "# 将'E'列的前两个行设为1\n",
    "df1.loc[dates[0]:dates[1],'E'] = 1\n",
    "\n",
    "df1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 处理缺失数据\n",
    "# 剔除df1中含NaN的行（只要任一一列为NaN就算）\n",
    "df2 = df1.dropna(how='any')\n",
    "print(df2)\n",
    "print(\"--------------------------------------\")\n",
    "# 用5填充df1里的缺失值\n",
    "df2 = df1.fillna(value=5)\n",
    "print(df2)\n",
    "print(\"--------------------------------------\")\n",
    "# 判断df2中的值是否为缺失数据，返回True/False\n",
    "pd.isnull(df2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 练习"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "\n",
    "import pandas as pd\n",
    "\n",
    "# 奥林匹克运动会奖品数据\n",
    "df = pd.read_csv('datas/olympics.csv', index_col=0, skiprows=1)\n",
    "\n",
    "for col in df.columns:\n",
    "    if col[:2]=='01':\n",
    "        df.rename(columns={col:'Gold'+col[4:]}, inplace=True)\n",
    "    if col[:2]=='02':\n",
    "        df.rename(columns={col:'Silver'+col[4:]}, inplace=True)\n",
    "    if col[:2]=='03':\n",
    "        df.rename(columns={col:'Bronze'+col[4:]}, inplace=True)\n",
    "    if col[:1]=='№':\n",
    "        df.rename(columns={col:'#'+col[1:]}, inplace=True)\n",
    "\n",
    "names_ids = df.index.str.split('\\s\\(') # 通过'('　分解 index \n",
    "\n",
    "df.index = names_ids.str[0] # [0] 元素是国家 (作为新的index) \n",
    "df['ID'] = names_ids.str[1].str[:3] # [1] 元素是ID (前三个字母)\n",
    "\n",
    "df = df.drop('Totals')\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 问题１\n",
    "# 数据里面第一个国家是什么，需要返回Series(正行数据)\n",
    "df.iloc[0].name"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 问题2\n",
    "# 哪个国家在夏季奥运会上获得了最多的金牌？需要返回国家名字\n",
    "x = max(df['Gold'])\n",
    "print(x)\n",
    "ans = df[df['Gold'] == x].index.tolist()\n",
    "print(ans[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 问题3\n",
    "# 哪个国家的夏季和冬季金牌数量差异最大？需要返回国家名字\n",
    "x = max(df['Gold'] - df['Gold.1'])\n",
    "ans = df[(df['Gold'] - df['Gold.1']) == x].index.tolist()\n",
    "print(ans[0])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 问题４\n",
    "#　哪个国家的夏季和冬季金牌数量相差比例最大？ 仅包括夏季和冬季至少赢得1金的国家。\n",
    "df_gold = df[(df['Gold']>0) & (df['Gold.1']>0)]\n",
    "df_max_diff = (abs(df_gold['Gold']-    df_gold['Gold.1'])/df_gold['Gold.2'])\n",
    "print(df_max_diff.idxmax())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 操作\n",
    "\n",
    "此类操作默认排除缺失数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 重新创建一份数据\n",
    "dates = pd.date_range('20180101', periods=6)\n",
    "df = pd.DataFrame(np.ones((6,4)), index=dates, columns=list('ABCD'))\n",
    "s = pd.Series([2,2,2,2,2,2], index=dates)\n",
    "df['E'] = s\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 求平均值\n",
    "print(df.mean())\n",
    "print(\"------\")\n",
    "\n",
    "# 一行求平均值\n",
    "print(df.mean(1))\n",
    "print(\"------\")\n",
    "\n",
    "# 创建Series对象s，以dates为索引并平移2个位置\n",
    "s = pd.Series([1,3,5,np.nan,6,8], index=dates).shift(2)\n",
    "print(s)\n",
    "print(\"------\")\n",
    "\n",
    "# 从df中逐列减去s（若有NaN则得NaN）\n",
    "print(df.sub(s, axis='index'))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 逐行累加\n",
    "print(df.apply(np.cumsum))\n",
    "print(\"------\")\n",
    "\n",
    "# 每列的最大值减最小值\n",
    "print(df.apply(lambda x: x.max() - x.min()))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 字符\n",
    "# Series对象的str属性具有一系列字符处理方法，可以很轻松地操作数组的每个元素。\n",
    "s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])\n",
    "print(s.str.lower())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 合并\n",
    "\n",
    "Pandas在join/merge两中情境下提供了支持多种方式，基于逻辑/集合运算和代数运算来连接Series，DataFrame和Panel对象。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# concat()方法连接数组\n",
    "df = pd.DataFrame(np.random.randn(8, 4))\n",
    "print(df)\n",
    "print(\"------\")\n",
    "\n",
    "# 拆分成块\n",
    "pieces = [df[:2], df[3:5], df[6:]]\n",
    "\n",
    "# 重新连接，可得初始数组\n",
    "print(pd.concat(pieces))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 增补"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])\n",
    "print(df)\n",
    "print(\"------\")\n",
    "\n",
    "# 将索引为3的行增补到整个DataFrame最后\n",
    "s = df.iloc[3]\n",
    "print(df.append(s, ignore_index=True))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 组合"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 新建DataFrame对象df\n",
    "df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'], 'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'], 'C' : np.random.randn(8), 'D' : np.random.randn(8)})\n",
    "\n",
    "print(df)\n",
    "print(\"------\")\n",
    "\n",
    "\n",
    "# 对'A'列进行合并并应用.sum()函数\n",
    "print(df.groupby('A').sum())\n",
    "print(\"------\")\n",
    "\n",
    "\n",
    "# 对'A', 'B'两列分别合并形成层级结构，再应用.sum()函数\n",
    "print(df.groupby(['A','B']).sum())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 重塑"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tuples = list(zip(*[\n",
    "    ['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],\n",
    "    ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]))\n",
    "\n",
    "# 多重索引\n",
    "index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])\n",
    "\n",
    "df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])\n",
    "df2 = df[:4]\n",
    "\n",
    "print(df2)\n",
    "\n",
    "print(\"------\")\n",
    "\n",
    "# .stack()方法将DataFrame的列“压缩”了一级\n",
    "stacked = df2.stack()\n",
    "\n",
    "print(stacked)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 对于已经层次化的，具有多重索引的DataFrame或Series，\n",
    "# stack()的逆操作是unstack()——默认将最后一级“去层次化”。\n",
    "print(stacked.unstack())\n",
    "\n",
    "print(\"------\")\n",
    "\n",
    "print(stacked.unstack(1))\n",
    "\n",
    "print(\"------\")\n",
    "\n",
    "print(stacked.unstack(0))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 数据透视表"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame({\n",
    "    'A' : ['one', 'one', 'two', 'three'] * 3,\n",
    "    'B' : ['A', 'B', 'C'] * 4,\n",
    "    'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,\n",
    "    'D' : np.random.randn(12), 'E' : np.random.randn(12)\n",
    "})\n",
    "\n",
    "print(df)\n",
    "print(\"------\")\n",
    "\n",
    "# 生成数据透视表\n",
    "print(pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C']))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 时间序列"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "rng = pd.date_range('3/6/2018 00:00', periods=5, freq='D')\n",
    "ts = pd.Series(np.random.randn(len(rng)), rng)\n",
    "print(ts)\n",
    "print(\"------\")\n",
    "\n",
    "# 设定国际时区标准\n",
    "ts_utc = ts.tz_localize('UTC')\n",
    "print(ts_utc)\n",
    "\n",
    "print(\"------\")\n",
    "\n",
    "# 切换时区\n",
    "print(ts_utc.tz_convert('US/Eastern'))\n",
    "print(ts_utc.tz_convert('Hongkong'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 分类"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.DataFrame({\"id\":[1,2,3,4,5,6], \"raw_grade\":['a', 'b', 'b', 'a', 'a', 'e']})\n",
    "# 将原始记录转换为分类类型\n",
    "df[\"grade\"] = df[\"raw_grade\"].astype(\"category\")\n",
    "\n",
    "print(df[\"grade\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df[\"grade\"].cat.categories = [\"very good\", \"good\", \"very bad\"]\n",
    "\n",
    "# 重排类别同时添加上缺失的类别名称\n",
    "df[\"grade\"] = df[\"grade\"].cat.set_categories([\"very bad\", \"bad\", \"medium\", \"good\", \"very good\"])\n",
    "\n",
    "print(df[\"grade\"])\n",
    "print(\"------\")\n",
    "\n",
    "# 排序在各分类中分别进行\n",
    "print(df.sort_values(by=\"grade\"))\n",
    "print(\"------\")\n",
    "\n",
    "# 对类别列分组可以显示空类\n",
    "print(df.groupby(\"grade\").size())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "pandas功能非常多，如果要熟练使用，多看文档，https://apachecn.github.io/pandas-doc-zh/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 练习"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# 哪个州的县最多？\n",
    "\n",
    "# 读入数据\n",
    "wz_df = pd.read_csv('datas/waizi_v2.csv')\n",
    "wz_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 问题１　把time换成Datetime类型\n",
    "print(wz_df.dtypes)\n",
    " \n",
    "wz_df['time'] = pd.to_datetime(wz_df['time'], format='%Y/%m/%d', errors='ignore')\n",
    "print('--------------------')\n",
    "print(wz_df.dtypes)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 问题２　根据type求price的和\n",
    "wz_df.groupby('type').sum()['price']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from datetime import datetime "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 问题３　根据时间，找出２０１７年１－６月的实际外资金额前３\n",
    "date = datetime.strptime(\"2017/06/01\", \"%Y/%m/%d\")\n",
    "wz_df= wz_df[wz_df['type']=='实际外资金额']\n",
    "wz_df[wz_df['time']<date].sort_values(by='price',ascending=False).head(3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 问题4 新加一列，把美元(price)换成人民币价格(*6.5倍)\n",
    "def func(data):\n",
    "    data['RMB'] = data['price'] * 6.5 \n",
    "    return data\n",
    "\n",
    "wz_df.apply(func,axis=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#问题4 第二种写法\n",
    "wz_df['RMB'] = wz_df['price']*6.5\n",
    "wz_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#问题5 保持结果为csv,文件名：result.csv\n",
    "wz_df.to_csv(\"result.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
